Now that we've seen how to run statistical testing and create supervised machine learning models in Python, it's time for you to apply this knowledge. This week has three challenges. Make sure to give it a try and complete all of them.
Some important notes for the challenges:
We are constantly monitoring the issues on the GitHub general repository (https://github.com/uva-cw-digitalanalytics/2021s2/issues) to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving.
Important: We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. This means you should now wait for our response before submitting a challenge :-)
We will use the Google Store data that we also saw in the video tutorials. Make sure to either have it by cloning the general repository, or downloading it from surfdrive (see link in the General Repository homepage) and placing it in the same folder as you are running this weekly challenge.
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LogisticRegression, LinearRegression
import statsmodels.api as sm
import numpy as np
import lime
from lime import lime_tabular
Our website has launched new campaigns to increase in sales (as binary, converted from order_euros) and revenue (order_euros).
We are interested in two campaigns:
We want to know if (a) each campaign led to an increase in sales compared to the other campaigns (i.e., any traffic source that is not set as CPC or referral) and, (b) if one campaign led to more sales than the other.
Both dependent variables (sales and revenue) should come from the order_euros variable.
We also want to understand how the device that someone has, and the location that someone is in, influence sales and revenue. This means you need also to create two additional independent variables:
Because the dataset is very large and it may take some time to run the code, we will select a random sample of 10% of the visits that are in the dataset. Please run the code below (exactly as it is):
import pandas as pd
visits = pd.read_pickle('googlestore_DA5weeklychallenges.pkl')
len(visits)
52308
# Selecting a sample of 10% visits. The "random_state" option ensures that everyone has the same data
visits = visits.sample(frac=0.1, random_state = 42)
len(visits)
5231
Create a RQ and hypothesis (or hypotheses) based on the case description above, and prepare the dataset and the variables needed to answer the RQ and hypothesis.
When everything is done:
For sales (i.e., whether someone made a purchase or not), you will need to transform a continuous variable (order_euros) into a binary variable (0 = no purchase, 1 = purchase).
Before creating RQs and hypotheses, I want to inspect the dataset first.
visits.head()
| affiliate | channelGrouping | cpc | date | device_browser | device_deviceCategory | device_isMobile | device_operatingSystem | fullVisitorId | geoNetwork_city | ... | trafficSource_adwordsClickInfo_slot | trafficSource_campaign | trafficSource_isTrueDirect | trafficSource_keyword | trafficSource_medium | trafficSource_referralPath | trafficSource_source | visitId | visitNumber | visitStartTime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36548 | 0 | Referral | 0 | 20171010 | Chrome | desktop | False | Macintosh | 6413139757155395885 | Austin | ... | NaN | (not set) | NaN | NaN | (none) | / | (direct) | 1507650614 | 1 | 1507650614 |
| 26376 | 0 | Display | 1 | 20171007 | Chrome | mobile | True | Android | 1122408680230456408 | NaN | ... | RHS | 1000557 | GA | US | en | Hybrid | GDN Text+Ban... | NaN | (User vertical targeting) | cpc | NaN | 1507418877 | 1 | 1507418877 | |
| 21940 | 0 | Organic Search | 0 | 20171006 | Chrome | desktop | False | Macintosh | 5963631948501878610 | Mexico City | ... | NaN | (not set) | NaN | (not provided) | organic | NaN | 1507316316 | 1 | 1507316316 | |
| 8907 | 0 | Organic Search | 0 | 20171003 | Chrome | mobile | True | Android | 233830236249633791 | NaN | ... | NaN | (not set) | NaN | (not provided) | organic | NaN | 1507067994 | 1 | 1507067994 | |
| 36441 | 0 | Direct | 0 | 20171010 | Chrome | desktop | False | Linux | 6912427824452801459 | Cambridge | ... | NaN | (not set) | True | NaN | (none) | NaN | (direct) | 1507648677 | 2 | 1507648677 |
5 rows × 79 columns
visits.columns
Index(['affiliate', 'channelGrouping', 'cpc', 'date', 'device_browser',
'device_deviceCategory', 'device_isMobile', 'device_operatingSystem',
'fullVisitorId', 'geoNetwork_city', 'geoNetwork_continent',
'geoNetwork_country', 'geoNetwork_metro', 'geoNetwork_networkDomain',
'geoNetwork_region', 'geoNetwork_subContinent', 'isExit',
'landing_appInfo_landingScreenName', 'landing_appInfo_screenDepth',
'landing_appInfo_screenName', 'landing_contentGroup_contentGroup1',
'landing_contentGroup_contentGroup2',
'landing_contentGroup_contentGroup3',
'landing_contentGroup_contentGroup4',
'landing_contentGroup_contentGroup5', 'landing_hour',
'landing_isEntrance', 'landing_isExit', 'landing_minute',
'landing_page_hostname', 'landing_page_pagePath',
'landing_page_pagePathLevel1', 'landing_page_pagePathLevel2',
'landing_page_pagePathLevel3', 'landing_page_pagePathLevel4',
'landing_page_pageTitle', 'landing_product_isClick',
'landing_product_isImpression', 'landing_product_productBrand',
'landing_product_productListName',
'landing_product_productListPosition', 'landing_product_productPrice',
'landing_product_productQuantity', 'landing_product_productSKU',
'landing_product_productVariant', 'landing_product_v2ProductCategory',
'landing_product_v2ProductName',
'landing_promotionActionInfo_promoIsView',
'landing_promotion_promoCreative', 'landing_promotion_promoId',
'landing_promotion_promoName', 'landing_promotion_promoPosition',
'landing_referer', 'landing_social_hasSocialSourceReferral',
'landing_social_socialInteractionNetworkAction',
'landing_social_socialNetwork', 'order_euros', 'referral',
'totals_bounces', 'totals_newVisits', 'totals_pageviews',
'totals_timeOnSite', 'totals_transactionRevenue', 'totals_transactions',
'trafficSource_adContent',
'trafficSource_adwordsClickInfo_adNetworkType',
'trafficSource_adwordsClickInfo_gclId',
'trafficSource_adwordsClickInfo_isVideoAd',
'trafficSource_adwordsClickInfo_page',
'trafficSource_adwordsClickInfo_slot', 'trafficSource_campaign',
'trafficSource_isTrueDirect', 'trafficSource_keyword',
'trafficSource_medium', 'trafficSource_referralPath',
'trafficSource_source', 'visitId', 'visitNumber', 'visitStartTime'],
dtype='object')
#check the category of device
visits['device_operatingSystem'].value_counts()
Android 1760 Windows 1401 Macintosh 1047 iOS 710 Linux 157 Chrome OS 121 (not set) 29 Windows Phone 3 Tizen 2 Samsung 1 Name: device_operatingSystem, dtype: int64
#test the category of location
visits['geoNetwork_country'].value_counts()
United States 2510
United Kingdom 308
India 274
Canada 162
Mexico 129
...
Zimbabwe 1
Burundi 1
Tanzania 1
Myanmar (Burma) 1
Cambodia 1
Name: geoNetwork_country, Length: 125, dtype: int64
As I want to explore the influence to sales and revenue, I want to pick the categories that can influence the users' purchase abilities. I want to explore how the users of Apple devices, and the users from United States influence sales and revenue, as these two groups have reletively higher purchase abilities compared to other categories — Apple devices are in general more expensive than other devices, and United States is the country that has the highest GDP in the world.
According to the case description and data inspection, I form research questions and hypotheses as following:
RQ1: To what extent have the new campaigns (CPC and referral) increased revenue and sales compared to other traffic sources to the website?
Hypotheses for sales:
Hypotheses for revenue:
RQ2: To what extent do the CPC and referral campaigns differ in the total revenues they bring?
RQ3: To what extent do the CPC and referral campaigns differ in the sales (the purchase behavior) among users?
RQ4: To what extend does the ownership of Apple devices have an influence on revenue and sales?
RQ5: To what extend does the location of United States have an influence on revenue and sales?
The needed IVs for this exercise are:
The needed DVs for this exercise are:
A few variables for the controls: Android devices, and United States
#only select the rows I need
visits = visits[['referral', 'cpc', 'order_euros', 'device_operatingSystem', 'geoNetwork_country']]
visits.head()
| referral | cpc | order_euros | device_operatingSystem | geoNetwork_country | |
|---|---|---|---|---|---|
| 36548 | 0 | 0 | 128.0 | Macintosh | United States |
| 26376 | 0 | 1 | NaN | Android | United States |
| 21940 | 0 | 0 | 154.0 | Macintosh | Mexico |
| 8907 | 0 | 0 | NaN | Android | Italy |
| 36441 | 0 | 0 | NaN | Linux | United States |
visits.dtypes
referral int64 cpc int64 order_euros float64 device_operatingSystem object geoNetwork_country object dtype: object
visits.isna().sum()
referral 0 cpc 0 order_euros 2390 device_operatingSystem 0 geoNetwork_country 0 dtype: int64
visits['order_euros'].describe()
count 2841.000000 mean 403.511088 std 312.561567 min 1.000000 25% 124.000000 50% 322.000000 75% 679.000000 max 999.000000 Name: order_euros, dtype: float64
From the result I discover that the column order_euros only contains the data from someone who made purchase, because the minimal value in this column is 1. Combining with the result that this column has 2390 missing values, the missing values should be the data for someone who did not make purchase, and should be filled with 0.
#fill the missing value in 'order_euros'
visits['order_euros'] = visits['order_euros'].fillna(0)
Then I can create the needed variables according to the list above.
#create a variable for other campaigns
def generate_other(row):
if row['cpc'] == 1:
row['other'] = 0
elif row['referral'] == 1:
row['other'] = 0
else:
row['other'] = 1
return row
visits = visits.apply(generate_other, axis=1)
#create a variable indicating campaign categories
def generate_category(row):
if row['cpc'] == 1:
row['cat_campaign'] = 'cpc'
if row['referral'] == 1:
row['cat_campaign'] = 'referral'
if row['other'] == 1:
row['cat_campaign'] = 'other'
return row
visits = visits.apply(generate_category, axis=1)
#create a variable for sales (purchase behavior)
def generate_sales(row):
if row['order_euros'] == 0:
row['sales'] = 0
else:
row['sales'] = 1
return row
visits = visits.apply(generate_sales, axis=1)
#change the name of the 'order_euros' column to revenue
visits.rename(columns={'order_euros':'revenue'}, inplace=True)
#create a variable for Android device
def wordlist_any_present(text, query):
import re
text = str(text).lower()
newquery = []
for word in query:
newquery.append(str(word).lower())
tokens = re.findall(r"[\w']+|[.,!?;$@#]", text)
for word in newquery:
if word in tokens:
return 1
return 0
visits['apple_device'] = visits['device_operatingSystem'].apply(wordlist_any_present, args=(['Macintosh', 'iOS'],))
#create a variable for United States
def wordlist_present(text, query):
import re
text = str(text).lower()
newquery = []
for word in query:
newquery.append(str(word).lower())
tokens = re.findall(r"[\w']+|[.,!?;$@#]", text)
if set(newquery).issubset(tokens):
return 1
return 0
visits['country_US'] = visits['geoNetwork_country'].apply(wordlist_present, args=(['United', 'States', ],))
visits.head()
| referral | cpc | revenue | device_operatingSystem | geoNetwork_country | other | cat_campaign | sales | apple_device | country_US | |
|---|---|---|---|---|---|---|---|---|---|---|
| 36548 | 0 | 0 | 128.0 | Macintosh | United States | 1 | other | 1 | 1 | 1 |
| 26376 | 0 | 1 | 0.0 | Android | United States | 0 | cpc | 0 | 0 | 1 |
| 21940 | 0 | 0 | 154.0 | Macintosh | Mexico | 1 | other | 1 | 1 | 0 |
| 8907 | 0 | 0 | 0.0 | Android | Italy | 1 | other | 0 | 0 | 0 |
| 36441 | 0 | 0 | 0.0 | Linux | United States | 1 | other | 0 | 0 | 1 |
visits['cat_campaign'].value_counts()
other 3360 cpc 1326 referral 545 Name: cat_campaign, dtype: int64
sns.countplot(x='cat_campaign', data=visits)
<AxesSubplot:xlabel='cat_campaign', ylabel='count'>
sns.countplot(x='referral', data=visits)
<AxesSubplot:xlabel='referral', ylabel='count'>
sns.countplot(x='cpc', data=visits)
<AxesSubplot:xlabel='cpc', ylabel='count'>
sns.countplot(x='other', data=visits)
<AxesSubplot:xlabel='other', ylabel='count'>
In this dataset, there are 545 data from referral campaign, 1326 from CPC campaign, and 3360 data from other caimpaigns. The number of other campaigns are way more than CPC and referral campaigns (the number for referral campaign is especially small), which may mean that this dataset (or the sample I acquire from this dataset) is not balanced.
The right thing to do here should be to require another sample of the visits dataset that contains more balanced number for the three types of campaigns, or to include more cases in the sanple, but I will go on because the random state is the same for everyone.
visits['sales'].value_counts()
1 2841 0 2390 Name: sales, dtype: int64
sns.countplot(x='sales', data=visits)
<AxesSubplot:xlabel='sales', ylabel='count'>
visits['revenue'].describe()
count 5231.000000 mean 219.150258 std 305.713499 min 0.000000 25% 0.000000 50% 38.000000 75% 363.000000 max 999.000000 Name: revenue, dtype: float64
sns.distplot(visits['revenue'])
/opt/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='revenue', ylabel='Density'>
From the result I find that more users bought something in the website — there were 2390 users did not purchase and 2841 users purchased. The largest revenue generated was €999.
I have five pairs of hypothesis:
H1a. Users entering the website via the CPC campaign will be more likely to make a purchase compared to users from other traffic sources.
H1b. Users entering the website via the referral campaign will be more likely to make a purchase compared to users from other traffic sources.
H2a. Users entering the website via the CPC campaign will have more expensive orders compared to users entering from other traffic sources.
H2b. Users entering the website via the referral campaign will have more expensive orders compared to users entering other traffic sources.
H3a. Users entering the website via the CPC campaign will be more likely to make a purchase compared to users entering the website via the referral campaign.
H3b. Users entering the website via the CPC campaign will have more expensive orders compared to users entering the website via the referral campaign.
H4a. Users who have Apple devices will be more likely to make a purchase compared to users who have other kinds of devices.
H4b. Users who have Apple devices will have more expensive orders compared to users who have other kinds of devices.
H5a. Users from United States will be more likely to make a purchase compared to users from other countries.
H5b. Users from United States will have more expensive orders compared to users users from other countries.
#visualize H1a
sns.barplot(x='cpc', y='sales', data=visits)
<AxesSubplot:xlabel='cpc', ylabel='sales'>
From the visualization I find that users entering the website via the CPC campaign were more likely to make a purchase compared to users from other traffic sources. The difference is statistically significant according to the confidence interval. From the visualization, H1a is confirmed by the past data.
#visualize H1b
sns.barplot(x='referral', y='sales', data=visits)
<AxesSubplot:xlabel='referral', ylabel='sales'>
Users entering the website via the referral campaign were less likely to make a purchase compared to users from other traffic sources, but the difference is not so significant. From the visualization, H1b is not supported by the past data.
#visualize H2a
sns.barplot(x='cpc', y='revenue', data=visits)
<AxesSubplot:xlabel='cpc', ylabel='revenue'>
Users entering the website via the CPC campaign had more expensive orders compared to users entering from other traffic sources. The difference is statistically significant. From the visualization, H2a is confirmed by the past data.
#visualize H2b
sns.barplot(x='referral', y='revenue', data=visits)
<AxesSubplot:xlabel='referral', ylabel='revenue'>
Although they were less likely to make a purchase, users entering the website via the referral campaign had more expensive orders compared to users entering from other traffic sources. The difference is statistically significant. From the visualization, H2b is confirmed by the past data.
#visualize H3a
sns.barplot(x='cat_campaign', y='sales', data=visits)
<AxesSubplot:xlabel='cat_campaign', ylabel='sales'>
Users entering the website via the CPC campaign were more likely to make a purchase compared to users entering the website via the referral campaign. The difference is statistically significant. From the visualization, H3a is confirmed by the past data.
#visualize H3b
sns.barplot(x='cat_campaign', y='revenue', data=visits)
<AxesSubplot:xlabel='cat_campaign', ylabel='revenue'>
Users entering the website via the CPC campaign had more expensive orders compared to users entering the website via the referral campaign, but the difference is not statistically significant. From the visualization, H3b is not supported by the past data.
#visualize H4a
sns.barplot(x='apple_device', y='sales', data=visits)
<AxesSubplot:xlabel='apple_device', ylabel='sales'>
Users who have Apple devices were less likely to make a purchase compared to users who have other kinds of devices. The difference is not statistically significant. From the visualization, H4a is not supported by the past data.
#visualize H4b
sns.barplot(x='apple_device', y='revenue', data=visits)
<AxesSubplot:xlabel='apple_device', ylabel='revenue'>
Users who have Apple devices had less expensive orders compared to users who have other kinds of devices. The difference is statistically significant. From the visualization, H4b is rejected by the past data.
#visualize H5a
sns.barplot(x='country_US', y='sales', data=visits)
<AxesSubplot:xlabel='country_US', ylabel='sales'>
Users from United States were more likely to make a purchase compared to users from other countries. The difference is not statistically significant. From the visualization, H5a is not supported by the past data.
#visualize H5b
sns.barplot(x='country_US', y='revenue', data=visits)
<AxesSubplot:xlabel='country_US', ylabel='revenue'>
Users from United States had more expensive orders compared to users users from other countries. The difference is statistically significant. From the visualization, H5b is confirmed by the past data.
visits[['cat_campaign', 'sales', 'revenue']].groupby(['cat_campaign']).describe().transpose()
| cat_campaign | cpc | other | referral | |
|---|---|---|---|---|
| sales | count | 1326.000000 | 3360.000000 | 545.000000 |
| mean | 0.686275 | 0.495238 | 0.489908 | |
| std | 0.464181 | 0.500052 | 0.500357 | |
| min | 0.000000 | 0.000000 | 0.000000 | |
| 25% | 0.000000 | 0.000000 | 0.000000 | |
| 50% | 1.000000 | 0.000000 | 0.000000 | |
| 75% | 1.000000 | 1.000000 | 1.000000 | |
| max | 1.000000 | 1.000000 | 1.000000 | |
| revenue | count | 1326.000000 | 3360.000000 | 545.000000 |
| mean | 453.440422 | 97.737202 | 397.644037 | |
| std | 362.823902 | 151.468950 | 423.634404 | |
| min | 0.000000 | 0.000000 | 0.000000 | |
| 25% | 0.000000 | 0.000000 | 0.000000 | |
| 50% | 515.500000 | 0.000000 | 0.000000 | |
| 75% | 786.250000 | 155.000000 | 850.000000 | |
| max | 999.000000 | 876.000000 | 998.000000 |
From the result of the groupby chart, I find that:
For this challenge, we would like you to focus on sales (binary variable) as the DV.
You need to test the hypotheses and make predictions for each campaign using ML. In other words, you need to:
Don't forget to interpret the results in MarkDown, and indicate whether your hypotheses were supported, not supported (or even rejected).
Because sales is a binary variable, for this exercise, I use logistic regression.
My hypothese related to sales that are waited to be tested by the model are:
#use logistic regression to create statistical model
logit_sales = sm.Logit(visits['sales'], sm.add_constant(visits[['cpc', 'referral','apple_device', 'country_US']]))
result_sales = logit_sales.fit()
print(result_sales.summary())
Optimization terminated successfully.
Current function value: 0.675049
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: sales No. Observations: 5231
Model: Logit Df Residuals: 5226
Method: MLE Df Model: 4
Date: Wed, 03 Mar 2021 Pseudo R-squ.: 0.02085
Time: 10:36:30 Log-Likelihood: -3531.2
converged: True LL-Null: -3606.4
Covariance Type: nonrobust LLR p-value: 1.662e-31
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
const -0.0266 0.048 -0.550 0.582 -0.121 0.068
cpc 0.7907 0.073 10.854 0.000 0.648 0.933
referral -0.0153 0.094 -0.163 0.870 -0.199 0.168
apple_device -0.0133 0.062 -0.214 0.831 -0.135 0.108
country_US 0.0296 0.060 0.496 0.620 -0.087 0.146
================================================================================
H1a: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign will on average, more likely to make a purchase compared to users from other traffic sources (p < .001).
H1b: Referral campaign has a negative effect on sales. Users entering the website via the referral campaign will be less likely to purchase something compared to users from other traffic sources. However, the p value (p = 0.87) is larger than 0.05, indicating that the negative effect is not obvious.
H4a: Users who have Apple devices negatively predicts the purchase behavior, but the difference between users of Apple devices and other devices is not significant (p > .05).
H5a: The partial effect of the location of US is positive but not statistically significant (p > .05).
#use logistic regression to create statistical model for H3a
logit_sales2 = sm.Logit(visits['sales'], sm.add_constant(visits[['cpc', 'other', 'apple_device', 'country_US']]))
result_sales2 = logit_sales2.fit()
print(result_sales2.summary())
Optimization terminated successfully.
Current function value: 0.675049
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: sales No. Observations: 5231
Model: Logit Df Residuals: 5226
Method: MLE Df Model: 4
Date: Wed, 03 Mar 2021 Pseudo R-squ.: 0.02085
Time: 10:36:30 Log-Likelihood: -3531.2
converged: True LL-Null: -3606.4
Covariance Type: nonrobust LLR p-value: 1.662e-31
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
const -0.0419 0.088 -0.474 0.635 -0.215 0.131
cpc 0.8060 0.109 7.368 0.000 0.592 1.020
other 0.0153 0.094 0.163 0.870 -0.168 0.199
apple_device -0.0133 0.062 -0.214 0.831 -0.135 0.108
country_US 0.0296 0.060 0.496 0.620 -0.087 0.146
================================================================================
H3a: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign are on average, more likely to make a purchase compared to users entering the website via the referral campaign (p < .001).
| Hypothesis | Result |
|---|---|
| H1a | confirmed |
| H1b | not supported |
| H3a | confirmed |
| H4a | not supported |
| H5a | not supported |
logit_visits = LogisticRegression(max_iter=1000, fit_intercept = True)
logit_visits.fit(visits[['cpc', 'referral', 'other', 'apple_device', 'country_US']], visits['sales'])
LogisticRegression(max_iter=1000)
logit_visits.predict_proba([[1,0,0,0,0]])
array([[0.31824608, 0.68175392]])
For someone come to the website via the CPC campaign, not using Apple device and not from America, they have 31.8% possibility to not purchase anything, and 68.2% possibility to purchase something.
logit_visits.predict_proba([[1,0,0,1,0]])
array([[0.32124293, 0.67875707]])
For someone come to the website via the CPC campaign, using Apple device and not from America, they have 32.1% possibility to not purchase anything, and 67.9% possibility to purchase something.
logit_visits.predict_proba([[1,0,0,1,1]])
array([[0.31468736, 0.68531264]])
For someone come to the website via the CPC campaign, using Apple device and from America, they have 31.5% possibility to not purchase anything, and 68.5% possibility to purchase something.
logit_visits.predict_proba([[1,0,0,0,1]])
array([[0.31172358, 0.68827642]])
For someone come to the website via the CPC campaign, not using Apple device and from America, they have 31.2% possibility to not purchase anything, and 68.8% possibility to purchase something.
logit_visits.predict_proba([[0,1,0,0,0]])
array([[0.50995777, 0.49004223]])
For someone come to the website via the referral campaign, not using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something.
logit_visits.predict_proba([[0,1,0,1,0]])
array([[0.51340043, 0.48659957]])
For someone come to the website via the referral campaign, using Apple device and not from America, they have 51.3% possibility to not purchase anything, and 48.7% possibility to purchase something.
logit_visits.predict_proba([[0,1,0,1,1]])
array([[0.50584591, 0.49415409]])
For someone come to the website via the referral campaign, using Apple device and from America, they have 50.6% possibility to not purchase anything, and 49.4% possibility to purchase something.
logit_visits.predict_proba([[0,1,0,0,1]])
array([[0.50240161, 0.49759839]])
For someone come to the website via the referral campaign, not using Apple device and from America, they have 50.2% possibility to not purchase anything, and 49.8% possibility to purchase something.
logit_visits.predict_proba([[0,0,1,0,0]])
array([[0.50658529, 0.49341471]])
For someone come to the website via other campaigns, not using Apple device and not from America, they have 50.7% possibility to not purchase anything, and 49.3% possibility to purchase something.
logit_visits.predict_proba([[0,0,1,1,0]])
array([[0.51002888, 0.48997112]])
For someone come to the website via other campaigns, using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something.
logit_visits.predict_proba([[0,0,1,1,1]])
array([[0.50247274, 0.49752726]])
For someone come to the website via other campaigns, using Apple device and from America, they have 50.2% possibility to not purchase anything, and 49.8% possibility to purchase something.
logit_visits.predict_proba([[0,0,1,0,1]])
array([[0.49902821, 0.50097179]])
For someone come to the website via other campaigns, not using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something.
lime_sales = visits[['cpc', 'referral', 'other', 'apple_device', 'country_US', 'sales']]
class_names_sales = lime_sales.columns
X_lime_sales = lime_sales[['cpc', 'referral', 'other', 'apple_device', 'country_US']].to_numpy()
y_lime_sales = lime_sales['sales'].to_numpy()
explainer_logit = lime.lime_tabular.LimeTabularExplainer(
X_lime_sales,
feature_names = class_names_sales,
verbose = True,
mode = 'classification')
#for cpc campaign
sales_cpc1 = explainer_logit.explain_instance(np.array([1,0,0,0,0]), logit_visits.predict_proba)
sales_cpc1.show_in_notebook(show_table=True)
Intercept 0.4318921498024505 Prediction_local [0.68416338] Right: 0.681753918386001
For someone come to the website via the CPC campaign, not using Apple device and not from America, they will have 32% possibility to not purchase anything, and 68% possibility to purchase something. The CPC campaign will make them more likely to purchase, while their locations make them less likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.
sales_cpc2 = explainer_logit.explain_instance(np.array([1,0,0,1,0]), logit_visits.predict_proba)
sales_cpc2.show_in_notebook(show_table=True)
Intercept 0.4352989507049099 Prediction_local [0.68071275] Right: 0.6787570656250793
For someone come to the website via the CPC campaign, using Apple device and not from America, they have 32% possibility to not purchase anything, and 68% possibility to purchase something. The CPC campaign will make them more likely to purchase, while their locations make them less likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.
sales_cpc3 = explainer_logit.explain_instance(np.array([1,0,0,1,1]), logit_visits.predict_proba)
sales_cpc3.show_in_notebook(show_table=True)
Intercept 0.4278031057774712 Prediction_local [0.68826546] Right: 0.6853126367936688
For someone come to the website via the CPC campaign, using Apple device and from America, they have 31% possibility to not purchase anything, and 69% possibility to purchase something. The CPC campaign and location of US will make them more likely to purchase. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.
sales_cpc4 = explainer_logit.explain_instance(np.array([1,0,0,0,1]), logit_visits.predict_proba)
sales_cpc4.show_in_notebook(show_table=True)
Intercept 0.42479625714558295 Prediction_local [0.69144525] Right: 0.6882764195936123
For someone came to the website via the CPC campaign, not using Apple device and from America, they have 31% possibility to not purchase anything, and 69% possibility to purchase something. The CPC campaign and location of US will make them more likely to purchase. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.
#for referral campaign
sales_referral1 = explainer_logit.explain_instance(np.array([0,1,0,0,0]), logit_visits.predict_proba)
sales_referral1.show_in_notebook(show_table=True)
Intercept 0.625521395551573 Prediction_local [0.49026952] Right: 0.4900422298898613
For someone come to the website via the referral campaign, not using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.
sales_referral2 = explainer_logit.explain_instance(np.array([0,1,0,1,0]), logit_visits.predict_proba)
sales_referral2.show_in_notebook(show_table=True)
Intercept 0.6288320122543399 Prediction_local [0.48711105] Right: 0.48659957163700174
For someone come to the website via the referral campaign, using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.
sales_referral3 = explainer_logit.explain_instance(np.array([0,1,0,1,1]), logit_visits.predict_proba)
sales_referral3.show_in_notebook(show_table=True)
Intercept 0.6214690785728023 Prediction_local [0.49449619] Right: 0.4941540909182348
For someone come to the website via the referral campaign, using Apple device and from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.
sales_referral4 = explainer_logit.explain_instance(np.array([0,1,0,0,1]), logit_visits.predict_proba)
sales_referral4.show_in_notebook(show_table=True)
Intercept 0.6180685146591681 Prediction_local [0.49753879] Right: 0.49759839424643804
For someone come to the website via the referral campaign, not using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.
#for other campaigns
sales_other1 = explainer_logit.explain_instance(np.array([0,0,1,0,0]), logit_visits.predict_proba)
sales_other1.show_in_notebook(show_table=True)
Intercept 0.6222432229663373 Prediction_local [0.49385938] Right: 0.4934147066176886
For someone come to the website via other campaigns, not using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.
sales_other2 = explainer_logit.explain_instance(np.array([0,0,1,1,0]), logit_visits.predict_proba)
sales_other2.show_in_notebook(show_table=True)
Intercept 0.6258053293908361 Prediction_local [0.49042369] Right: 0.48997111979371866
For someone come to the website via other campaigns, using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.
sales_other3 = explainer_logit.explain_instance(np.array([0,0,1,1,1]), logit_visits.predict_proba)
sales_other3.show_in_notebook(show_table=True)
Intercept 0.6180373100242279 Prediction_local [0.49789457] Right: 0.4975272574230635
For someone come to the website via other campaigns, using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.
sales_other4 = explainer_logit.explain_instance(np.array([0,0,1,0,1]), logit_visits.predict_proba)
sales_other4.show_in_notebook(show_table=True)
Intercept 0.6147617425051557 Prediction_local [0.50124615] Right: 0.5009717872964901
For someone come to the website via other campaigns, not using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.
Through the comparison I find that the difference of device type and location — comparing with the difference of campaign — cannot determine users' purchase behavior largely.
For this challenge, we would like you to focus on revenue (continuous variable) as the DV.
You need to test the hypotheses and make predictions for each campaign using ML. In other words, you need to:
Don't forget to interpret the results in MarkDown, and indicate whether your hypotheses were supported, not supported (or even rejected).
Because revenue is a continuous variable, for this exercise, I use linear regression.
My hypothese related to revenue that are waited to be tested by the model are:
#use linear regression to create statistical model
ols_sales = sm.OLS(visits['revenue'], sm.add_constant(visits[['cpc', 'referral','apple_device', 'country_US']]))
result_sales = ols_sales.fit()
print(result_sales.summary())
OLS Regression Results
==============================================================================
Dep. Variable: revenue R-squared: 0.286
Model: OLS Adj. R-squared: 0.286
Method: Least Squares F-statistic: 524.1
Date: Wed, 03 Mar 2021 Prob (F-statistic): 0.00
Time: 10:36:39 Log-Likelihood: -36475.
No. Observations: 5231 AIC: 7.296e+04
Df Residuals: 5226 BIC: 7.299e+04
Df Model: 4
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 94.5555 6.202 15.246 0.000 82.397 106.714
cpc 350.0847 8.959 39.078 0.000 332.522 367.647
referral 302.6920 12.083 25.052 0.000 279.005 326.379
apple_device -7.3142 7.929 -0.923 0.356 -22.858 8.229
country_US 14.1142 7.569 1.865 0.062 -0.725 28.954
==============================================================================
Omnibus: 73.804 Durbin-Watson: 2.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 75.070
Skew: 0.280 Prob(JB): 5.00e-17
Kurtosis: 2.827 Cond. No. 4.37
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
H2a: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign will on average, order way more expensive things compared to users from other traffic sources (p < .001).
H2b: Referral campaign also has a positive effect on sales. Users entering the website via the referral campaign will increase the revenue more compared to users from other traffic sources (p < .001).
H4b: Users who have Apple devices negatively predicts revenue, but the difference between users of Apple devices and other devices is not significant (p > .05).
H5b: The partial effect of the location of US is positive but not statistically significant (p > .05).
#use linear regression to create statistical model for H3b
ols_sales2 = sm.OLS(visits['sales'], sm.add_constant(visits[['cpc', 'other', 'apple_device', 'country_US']]))
result_sales2 = ols_sales2.fit()
print(result_sales2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.028
Model: OLS Adj. R-squared: 0.027
Method: Least Squares F-statistic: 37.78
Date: Wed, 03 Mar 2021 Prob (F-statistic): 3.30e-31
Time: 10:36:39 Log-Likelihood: -3702.5
No. Observations: 5231 AIC: 7415.
Df Residuals: 5226 BIC: 7448.
Df Model: 4
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 0.4896 0.022 22.585 0.000 0.447 0.532
cpc 0.1922 0.026 7.312 0.000 0.141 0.244
other 0.0039 0.023 0.169 0.866 -0.041 0.049
apple_device -0.0032 0.015 -0.214 0.830 -0.033 0.026
country_US 0.0071 0.014 0.497 0.620 -0.021 0.035
==============================================================================
Omnibus: 19821.597 Durbin-Watson: 2.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 779.082
Skew: -0.154 Prob(JB): 6.68e-170
Kurtosis: 1.135 Cond. No. 7.81
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
H3b: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign are on average, order more expensive things compared to users entering the website via the referral campaign (p < .001).
| Hypothesis | Result |
|---|---|
| H1a | confirmed |
| H1b | not supported |
| H2a | confirmed |
| H2b | confirmed |
| H3a | confirmed |
| H3b | confirmed |
| H4a | not supported |
| H4b | not supported |
| H5a | not supported |
| H5b | not supported |
ols_visits = LinearRegression(fit_intercept = True)
ols_visits.fit(visits[['cpc', 'referral', 'other', 'apple_device', 'country_US']], visits['revenue'])
LinearRegression()
ols_visits.predict([[1,0,0,0,0]])
array([442.5])
The expected revenue for someone come to the website via the CPC campaign, not using Apple device and not from America, is €442.5.
ols_visits.predict([[1,0,0,1,0]])
array([435.375])
The expected revenue for someone come to the website via the CPC campaign, using Apple device and not from America, is €435.4.
ols_visits.predict([[1,0,0,1,1]])
array([449.625])
The expected revenue for someone come to the website via the CPC campaign, using Apple device and from America, is €449.6.
ols_visits.predict([[1,0,0,0,1]])
array([456.75])
The expected revenue for someone come to the website via the CPC campaign, not using Apple device and from America, is €456.8.
ols_visits.predict([[0,1,0,0,0]])
array([395.875])
The expected revenue for someone come to the website via the referral campaign, not using Apple device and not from America, is €395.9.
ols_visits.predict([[0,1,0,1,0]])
array([388.75])
The expected revenue for someone come to the website via the referral campaign, using Apple device and not from America, is €388.8.
ols_visits.predict([[0,1,0,1,1]])
array([403.])
The expected revenue for someone come to the website via the referral campaign, using Apple device and from America, is €403.
ols_visits.predict([[0,1,0,0,1]])
array([410.125])
The expected revenue for someone come to the website via the referral campaign, not using Apple device and from America, is €410.1.
ols_visits.predict([[0,0,1,0,0]])
array([95.375])
The expected revenue for someone come to the website via other campaigns, not using Apple device and not from America, is €95.4.
ols_visits.predict([[0,0,1,1,0]])
array([88.25])
The expected revenue for someone come to the website via other campaigns, using Apple device and not from America, is €88.3.
ols_visits.predict([[0,0,1,1,1]])
array([102.5])
The expected revenue for someone come to the website via other campaigns, using Apple device and from America, is €102.5.
ols_visits.predict([[0,0,1,0,1]])
array([109.625])
The expected revenue for someone come to the website via other campaigns, not using Apple device and from America, is €109.6.
lime_revenue = visits[['cpc', 'referral', 'other', 'apple_device', 'country_US', 'revenue']]
class_names_revenue = lime_revenue.columns
X_lime_revenue = lime_revenue[['cpc', 'referral', 'other', 'apple_device', 'country_US']].to_numpy()
y_lime_revenue = lime_revenue['revenue'].to_numpy()
explainer_ols = lime.lime_tabular.LimeTabularExplainer(
X_lime_revenue,
feature_names = class_names_revenue,
class_names = ['revenue'],
verbose = True,
mode = 'regression',
discretize_continuous=True)
In this section, I do not know why LIME shows garbled texts and makes the predicted value very hard to read. It seems that the number I want to visualize is too large so they crush with each other.
#for CPC campaign
revenue_CPC1 = explainer_ols.explain_instance(np.array([1,0,0,0,0]), ols_visits.predict)
revenue_CPC1.show_in_notebook(show_table=True)
Intercept -1087531870328681.0 Prediction_local [3.62066094e+08] Right: 442.5
The expected revenue for someone come to the website via the CPC campaign, not using Apple device and not from America, is €442.5.
revenue_CPC2 = explainer_ols.explain_instance(np.array([1,0,0,1,0]), ols_visits.predict)
revenue_CPC2.show_in_notebook(show_table=True)
Intercept -1087018637013450.6 Prediction_local [-1.51455307e+10] Right: 435.375
The expected revenue for someone come to the website via the CPC campaign, using Apple device and not from America, is €435.38.
revenue_CPC3 = explainer_ols.explain_instance(np.array([1,0,0,1,1]), ols_visits.predict)
revenue_CPC3.show_in_notebook(show_table=True)
Intercept -1087263987317717.8 Prediction_local [-7.32256492e+10] Right: 449.625
The expected revenue for someone come to the website via the CPC campaign, using Apple device and from America, is €449.63.
revenue_CPC4 = explainer_ols.explain_instance(np.array([1,0,0,0,1]), ols_visits.predict)
revenue_CPC4.show_in_notebook(show_table=True)
Intercept -1087763112025751.8 Prediction_local [-8.88609965e+10] Right: 456.75
The expected revenue for someone come to the website via the CPC campaign, not using Apple device and from America, is €456.75.
#for referral campaign
revenue_referral1 = explainer_ols.explain_instance(np.array([0,1,0,0,0]), ols_visits.predict)
revenue_referral1.show_in_notebook(show_table=True)
Intercept -1089430610386312.6 Prediction_local [1.25448879e+12] Right: 395.875
The expected revenue for someone come to the website via the referral campaign, not using Apple device and not from America, is €395.88.
revenue_referral2 = explainer_ols.explain_instance(np.array([0,1,0,1,0]), ols_visits.predict)
revenue_referral2.show_in_notebook(show_table=True)
Intercept -1089504153344246.4 Prediction_local [1.48993201e+12] Right: 388.75
The expected revenue for someone come to the website via the referral campaign, using Apple device and not from America, is €388.75.
revenue_referral3 = explainer_ols.explain_instance(np.array([0,1,0,1,1]), ols_visits.predict)
revenue_referral3.show_in_notebook(show_table=True)
Intercept -1089335934539476.0 Prediction_local [1.59428768e+12] Right: 403.0
The expected revenue for someone come to the website via the referral campaign, using Apple device and from America, is €403.
revenue_referral4 = explainer_ols.explain_instance(np.array([0,1,0,0,1]), ols_visits.predict)
revenue_referral4.show_in_notebook(show_table=True)
Intercept -1089565705533471.1 Prediction_local [1.4802385e+12] Right: 410.125
The expected revenue for someone come to the website via the referral campaign, not using Apple device and from America, is €410.13.
#for other campaigns
revenue_other1 = explainer_ols.explain_instance(np.array([0,0,1,0,0]), ols_visits.predict)
revenue_other1.show_in_notebook(show_table=True)
Intercept -1087750440159968.9 Prediction_local [-2.74752821e+11] Right: 95.375
The expected revenue for someone come to the website via other campaigns, not using Apple device and not from America, is €95.38.
revenue_other2 = explainer_ols.explain_instance(np.array([0,0,1,1,0]), ols_visits.predict)
revenue_other2.show_in_notebook(show_table=True)
Intercept -1087331544243050.2 Prediction_local [-2.37135187e+11] Right: 88.25
The expected revenue for someone come to the website via other campaigns, using Apple device and not from America, is €88.25.
revenue_other3 = explainer_ols.explain_instance(np.array([0,0,1,1,1]), ols_visits.predict)
revenue_other3.show_in_notebook(show_table=True)
Intercept -1087448463521856.6 Prediction_local [-3.18183419e+11] Right: 102.5
The expected revenue for someone come to the website via other campaigns, using Apple device and from America, is €102.5.
revenue_other4 = explainer_ols.explain_instance(np.array([0,0,1,0,1]), ols_visits.predict)
revenue_other4.show_in_notebook(show_table=True)
Intercept -1087671486422577.0 Prediction_local [-2.84339116e+11] Right: 109.625
The expected revenue for someone come to the website via other campaigns, not using Apple device and from America, is €109.63.
As LIME is a framework still in development, we are not sure if it will work in all computers and configurations. If by any chance you get an error message when running LIME, hand in the challenge anyway (showing the error message), and we will accept it as complete.